There Is No Preview Available For This Item

This item does not appear to have any files that can be experienced on Archive.org.
Please download files in this item to interact with them on your computer.
Show all files

Google Ngrams - English (1 Million Most Common Words) 2grams

by: Google, Inc.

Publication date: 2011-09-27 05:49:03

Usage: Attribution 3.0

This item contains the Google 2gram data for the 1 million most common English words.

Here are the datasets backing the Google Books Ngram Viewer. These datasets were generated in July 2009; we will update these datasets as our book scanning continues, and the updated versions will have distinct and persistent version identifiers (20090715 for the current set).

Each of the numbered links below will directly download a fragment of the given corpus. For instance, the first ten links below collectively comprise the 1-gram (i.e., individual words) counts for English, as collected from Google's scanned books around July 15, 2009. In addition, for each corpus we provide the file total counts, which records the total number of 1-grams contained in the books that make up the corpus. This file is useful to compute the relative frequencies of n-grams.

Details on the corpus construction can be found in the Science article written by Jean-Baptiste Michel et al. but are abbreviated here. Of note, we report only the n-grams that appeared over 40 times in the whole corpus. Therefore, the sum of the 1-gram occurences in any given corpus is smaller than the number given in the total counts file.

File format: Each of the numbered files below is zipped tab-separated data. (Yes, we know the files have .csv extensions.) Each line has the following format:
ngram TAB year TAB match_count TAB page_count TAB volume_count NEWLINE
As an example, here are the 30,000,000th and 30,000,001st lines from file 0 of the English 1-grams (googlebooks-eng-all-1gram-20090715-0.csv.zip):
circumvallate   1978   313    215   85
circumvallate   1979   183    147   77
The first line tells us that in 1978, the word "circumvallate" (which means "surround with a rampart or other fortification", in case you were wondering) occurred 313 times overall, on 215 distinct pages and in 85 distinct books from our sample.

The format of the total counts file is identical, except that the ngram field is absent: there is only one triplet of values (match_count, page_count, volume_count) per year.

Here's the 9,000,000th line from file 0 of the English 5-grams (googlebooks-eng-all-5gram-20090715-0.csv.zip):
analysis is often described as  1991  1   1   1
In 1991, the phrase "analysis is often described as" occurred one time (that's the first 1), and on one page (the second 1), and in one book (the third 1).

Inside each file the ngrams are sorted alphabetically and then chronologically. Note that the files themselves aren't ordered with respect to one another. A French two word phrase starting with 'm' will be in the middle of one of the French 2gram files, but there's no way to know which without checking them all.

If datasets aren't yet complete, that means we're still busy uploading them. They'll be available soon.

Usage: This compilation is licensed under a Creative Commons Attribution 3.0 Unported License.

Access-restricted-item: true

Addeddate: 2011-09-27 05:49:03

Identifier: google_ngrams-eng-1M-2gram

Year: 2011

plus-circle Add Review

comment
Reviews

There are no reviews yet. Be the first one to write a review.

208 Views

1 Favorite

DOWNLOAD OPTIONS

No suitable files to display here.

2 Files
2 Original

SHOW ALL

IN COLLECTIONS

google ngrams

Data Collection

Internet Archive Web Crawls

Web Crawls

Uploaded by underscor on September 27, 2011

Internet Archive Audio

Featured

Top

Images

Featured

Top

Software

Featured

Top

Books

Featured

Top

Video

Featured

Top

Mobile Apps

Browser Extensions

Archive-It Subscription

Save Page Now

Google Ngrams - English (1 Million Most Common Words) 2grams

plus-circle Add Review

comment
Reviews

DOWNLOAD OPTIONS

IN COLLECTIONS

SIMILAR ITEMS (based on metadata)

Internet Archive Audio

Featured

Top

Images

Featured

Top

Software

Featured

Top

Books

Featured

Top

Video

Featured

Top

Mobile Apps

Browser Extensions

Archive-It Subscription

Save Page Now

Google Ngrams - English (1 Million Most Common Words) 2grams

Item Preview

Flag this item for

Google Ngrams - English (1 Million Most Common Words) 2grams

plus-circle Add Review comment Reviews

DOWNLOAD OPTIONS

IN COLLECTIONS

SIMILAR ITEMS (based on metadata)

plus-circle Add Review

comment
Reviews